# PURPOSE

This directory contains the code for the paper, CoFrGeNet: Continued Fraction Architectures for Language Generation.


# SETUP
The repo is a modification of nanoGPT by Karpathy. Below you will see directions for installing, training and inferencing. However, first we describe the main changes over nanoGPT and how to instantiate the CoFrNet replacements for the standard attention and mlp blocks in the gpt models so as to run our architecture.

- model.py: This is the main code that needs to be modified to run different architectures. We suggest three different architectures in the paper. These architectures can be chosen by modifying the code in the GPTConfig class (Line 621). To change the attention to use the continuants version of CoFrGeNet, change the attn variable to cofrC_arch3. To the change the MLP layer to use the CoFrGeNet, change the value of mlp variable to cofrC.
Similarly, adjust the layer depth and width by changing the parameters attn_Lower_depth, mlp_lower_depth, mlpC_width.

- utils_CoFrNet.py: This file contains the structure of the different CoFrNet architectures (viz. diagonalized, ladder-of-ladders, etc.)
- Customized_Linear_Classes.py: This file implements the forward and backward passes for the different CoFrNet architectures.
- CoFrNet_continuants.py The continuants version of the CoFrNet architecture
- config.py Change the configuration parameters like output directory, dataset, etc

## install
```
pip install torch numpy transformers datasets tiktoken wandb tqdm 
```

Dependencies:

- [pytorch](https://pytorch.org) <3
- [numpy](https://numpy.org/install/) <3
-  `transformers` for huggingface transformers <3 (to load GPT-2 checkpoints)
-  `datasets` for huggingface datasets <3 (if you want to download + preprocess OpenWebText)
-  `tiktoken` for OpenAI's fast BPE code <3
-  `wandb` for optional logging <3
-  `tqdm` for progress bars <3


## Data preparation

Download OpenWebText dataset https://openwebtext2.readthedocs.io/en/latest/

```sh
python data/openwebtext/prepare.py
```

This downloads and tokenizes the [OpenWebText](https://huggingface.co/datasets/openwebtext) dataset. It will create a `train.bin` and `val.bin` which holds the GPT2 BPE token ids in one sequence, stored as raw uint16 bytes. 

## Training

To train the model on 8X A100 40GB node:

```sh
torchrun --standalone --nproc_per_node=8 train.py config/train_gpt2.py
```

## sampling / inference

To sample from the model. 

```sh
python sample.py --out_dir=out-dir
```

# Gneissweb

The Gneissweb dataset can be downloaded from https://huggingface.co/datasets/ibm-granite/GneissWeb
Because of space requirements, we don't provide the subset of data that we used for training. This will be released as a huggingface dataset on acceptance. 

## acknowledgements

This code is adapted from the the NanoGPT repo by Andrej Karpathy.
